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Abstract. The LETOR website contains three information retrieval 
datasets used as a benchmark for testing machine learning ideas for 
ranking. Algorithms participating in the challenge are required to as- 
sign score values to search results for a collection of queries, and are 
measured using standard IR ranking measures (NDCG, precision, MAP) 
that depend only the relative score-induced order of the results. Simi- 
larly to many of the ideas proposed in the participating algorithms, we 
train a linear classifier. In contrast with other participating algorithms, 
we define an additional free variable (intercept, or benchmark) for each 
query. This allows expressing the fact that results for different queries are 
incomparable for the purpose of determining relevance. The cost of this 
idea is the addition of relatively few nuisance parameters. Our approach 
is simple, and we used a standard logistic regression library to test it. 
The results beat the reported participating algorithms. Hence, it seems 
promising to combine our approach with other more complex ideas. 

1 Introduction 



The LETOR benchmark dataset [5] |http:// research. microsoft . com/users/LETOR/| 



(version 2.0) contains three information retrieval datasets used as a benchmark 
for testing machine learning ideas for ranking. Algorithms participating in the 
challenge are required to assign score values to search results for a collection 
of queries, and are measured using standard IR ranking measures (NDCG@n, 
precision@n and MAP - see [S] for details), designed in such a way that only 
the relative order of the results matters. The input to the learning problem is 
a list of query-result records, where each record is a vector of standard IR fea- 
tures together with a relevance label and a query id. The label is either binary 
(irrelevant or relevant) or trinary (irrelevant, relevant or very relevant). 

All reported algorithms used for this task on LETOR website |2l3l5l7l8l9j 
rely on the fact that records corresponding to the same query id are in some 
sense comparable to each other, and cross query records are incomparable. The 
rationale is that the IR measures are computed as a sum over the queries, where 
for each query a nonlinear function is computed. For example, RankSVM [S] and 
RankBoost [3] use pairs of results for the same query to penalize a cost function, 
but never cross-query pairs of results. 

The following approach seems at first too naive compared to others: Since 
the training information is given as relevance labels, why not simply train a 
linear classifier to predict the relevance labels, and use prediction confidence as 
score? Unfortunately this approach fares poorly. The hypothesized reason is that 
judges' relevance response may depend on the query. To check this hypothesis, 
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we define an additional free variable {intercept or benchmark) for each query. 
This allows expressing the fact that results for different queries arc incomparable 
for the purpose of determining relevance. The cost of this idea is the addition of 
relatively few nuisance parameters. Our approach is extremely simple, and we 
used a standard logistic regression library to test it on the data. This work is 
not the first to suggest query dependent ranking, but it is arguably the simplest, 
most immediate way to address this dependence using linear classification before 
other complicated ideas should be tested. Based on our judgment, other reported 
algorithms used for the challenge are more complicated, and our solution is 
overall better on the given data. 

2 Theory and Experiments 

Let Qi, i = 1, . . . , n be a sequence of queries, and for each i let Rn, . . . , Rirm 
denote a corresponding set of retrieved results. For each i S [n] and j S [mi] 
let ^ij = (^ij(l), . . .<l>ij{k)) e H'^ denote a real valued feature vector. Here, the 
coordinates of are standard IR features. Some of these features depend on the 
result only, and some on the query- result pair, as explained in [6]. Also assume 
that for each i,j there is a judge's response label G O, where O is a finite 
set of ordinals. In the TREC datasets (TD2003 and TD2004), O = {0, 1}. In the 
OHSUMED dataset O = {0, 1, 2}. Higher numbers represent higher relevance. 

The Model. We assume the following generalized linear model for Lij given 
#y using the logit link. Other models are possible, but we chose this one for 
simplicity. Assume first that the set of ordinals is binary: O = {0, 1}. There is a 
hidden global weight vector w G W& . Aside from w, there is a query dependent 
parameter 0^ £ IR corresponding to each query Qi. We call this parameter 
a benchmark or an intercept. The intuition behind defining this parameter is 
to allow for a difi^erent relevance criterion to different queries. The probability 
distribution Pru,^e;(Lij|(5i, Rij) of response to result j for query i is given by 

J^r^ {L^J=l\Q^, R^J ) = ^ ^ ^0, , « | , % ) = 

In words, the probability of result j for query i deemed relevant is Oi — w-<Pij 
passed through the logit link, where w ■ is vector dot product. This process 
should be thought of as a statistical comparison between the value of a search 
result Rij (obtained as a linear function of its feature vector <Pij ) to a benchmark 
0i. In our setting, both the hnear coefficients w and the benchmark 0i, . . . , 0„ 
are variables which can be efficiently learnt in the maximum likelihood (super- 
vised) setting. Note that the total number of variables is n (number of queries) 
plus k (number of features). 

Observation: For any weight vector w, benchmark variable 0i correspond- 
ing to query Qi and two result incides j, k, 

Pr (Ly 1|Q„ i?y ) > Pr {L,k = Rtk) w ■ <P^j > w ■ <P,k ■ 

W,&i W.&i 

This last observation means that for the purpose of ranking candidate results for 
a specific query Qi in decreasing order of relevance likelihood, the benchmark 
parameter Oi is not needed. Indeed, in our experiments below the benchmark 
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variables will be used only in conjunction with the training data. In testing, this 
variable will neither be known nor necessary. 

The Trinay Case. As stated above, the labels for the OHSUMED case are 
trinary: O ~ {0, 1, 2}. We chose the following model to extend the binary case. 
Instead of one benchmark parameter for each query Qi there are two such pa- 
rameters, of ,0f' {High/ Low) with Of > Of. Giver a candidate result Rij to 
query Qi and the parameters, the probability distribution on the three possible 
ordinals is: 



)0 



X = 



Pr {Uj=X\Q,,R,j)= { 



X ^2 



In words, the result Rij is statistically compared against benchmark Of . If it 
is deemed higher than the benchmark, the label 2 (" very relevant" ) is outputted 
as response. Otherwise, the result is statistically compared against benchmark 
of and the resulting comparison is either (irrelevant) or 1 (relevant) The 
model is inspired by Ailon and Mohri's QuickSort algorithm, proposed as a 
learning method in their recent paper [T]: Pivot elements (or, benchmarks) are 
used to iteratively refine the ranking of data. 

Experiments. We used an out of the box implementation of logistic regres- 
sion in R to test the above ideas. Each one of the three datasets includes 5 
folds of data, each fold consisting of training, validation (not used) and test- 
ing data. From each training dataset, the variables w and Oi (or w,Of,Of in 
the OHSUMED case) were recovered in the maximum likelihood sense (using 
logistic regression). Note that the constraint Of > Of was not enforced, but 
was obtained as a byproduct. The weight vector w was then used to score the 
test data. The scores were passed through an evaluation tool provided by the 
LETOR website. 

Results. The results for OHSUMED are summarized in Tables [B [21 and [7l The 
results for TD2003 are summarized in TablesdH and[71 The results for TD2004 
are summarized in Tables [Sj [6l and [T] The significance of each score separately 
is quite small (as can be seen by the standard deviations), but it is clear that 
overall our method outperforms the others. For convenience, the winning average 
score (over 5 folds) is marked in red for each table column. 
Conclusions and further ideas • In this work we showed that a simple out-of- 
the-box generalized linear model using logistic regression performs as least as well 
the state of the art in learning ranking algorithms if a separate intercept variable 
(benchmark) is defined for each query • In a more eleborate IR system, a separate 
intercept variable could be attached to each pair of query x judge (indeed, in 
LETOR the separate judges' responses were aggregated somehow, but in general 



^ A natural alternative to this model is the following: Statistically compare against 
of to decide of the result is irrelevant. If it is not irrelevant, compare against &f 
to decide between relevant and very relevant. In practice, the model proposed above 
gave better results. 
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@2 


@4 


@6 


@8 


@10 


This 
RankBoost 
RankSVM 
FRank 
ListNet 
AdaRank.MAP 
AdaRank.NDCG 


0.491 ± 0.086 
0.483 ± 0.079 
0.476 ± 0.091 
0.510 ± 0.074 
0.497 ± 0.062 
0.496 ±0.100 
0.474 ± 0.091 


0.480 ± 0.058 
0.461 ± 0.063 
0.459 ± 0.059 
0.478 ± 0.060 
0.468 ± 0.065 
0.471 ± 0.075 
0.456 ± 0.057 


0.458 ± 0.055 
0.442 ± 0.058 
0.455 ± 0.054 
0.457 ± 0.062 
0.451 ± 0.056 
0.448 ± 0.070 
0.442 ± 0.055 


0.448 ± 0.054 
0.436 ± 0.044 
0.445 ± 0.057 
0.445 ± 0.054 
0.451 ± 0.050 
0.443 ± 0.058 
0.441 ± 0.048 


0.447 ± 0.047 
0.436 ± 0.042 
0.441 ± 0.055 
0.442 ± 0.055 
0.449 ± 0.040 
0.438 ± 0.057 
0.437 ± 0.046 


Table 1. OHSUMED: Mean ± Stdcv for NDCG over 5 folds 




@2 


@4 


@6 


@8 


@10 


This 
RankBoost 
RankSVM 
FRank 
ListNet 
AdaRank.MAP 
AdaRank.NDCG 


0.610 ±0.092 
0.595 ± 0.090 
0.619 ±0.096 
0.619 ±0.051 
0.629 ± 0.080 
0.605 ±0.102 
0.605 ± 0.099 


0.598 ± 0.082 
0.562 ±0.081 
0.579 ± 0.072 
0.581 ±0.079 
0.577 ± 0.097 
0.567 ±0.087 
0.562 ± 0.063 


0.560 ± 0.090 
0.525 ± 0.093 
0.558 ± 0.077 
0.534 ± 0.098 
0.544 ± 0.098 
0.528 ±0.102 
0.529 ± 0.073 


0.526 ± 0.092 
0.505 ± 0.072 
0.525 ± 0.088 
0.501 ±0.091 
0.520 ± 0.098 
0.502 ± 0.087 
0.506 ± 0.073 


0.511 ±0.081 
0.495 ± 0.081 
0.507 ± 0.096 
0.485 ± 0.097 
0.510 ±0.085 
0.491 ±0.091 
0.491 ± 0.082 


Table 2. OHSUMED: Mean ± Stdcv for precision over 5 folds 




@2 


@4 


®6 


@8 


@10 


This 
RankBoost 
RankSVM 
FRank 
ListNet 
AdaRank.MAP 
AdaRank.NDCG 


0.430 ±0.179 
0.280 ± 0.097 
0.370 ±0.130 
0.390 ±0.143 
0.430 ±0.160 
0.320 ±0.104 
0.410 ±0.207 


0.398 ±0.146 
0.272 ± 0.086 
0.363 ±0.132 
0.342 ±0.107 
0.386 ±0.125 
0.268 ±0.120 
0.347 ±0.195 


0.375 ±0.125 
0.280 ± 0.071 
0.341 ±0.118 
0.330 ± 0.087 
0.386 ±0.106 
0.229 ±0.104 
0.309 ±0.181 


0.369 ±0.113 
0.282 ± 0.074 
0.345 ±0.117 
0.332 ± 0.079 
0.373 ±0.104 
0.206 ± 0.093 
0.286 ± 0.171 


0.360 ±0.105 
0.285 ± 0.064 
0.341 ±0.115 
0.336 ± 0.074 
0.374 ± 0.094 
0.194 ±0.086 
0.270 ±0.161 


Table 3. TD2003: Mean ± Stdev for NDCG over 5 folds 




@2 


@4 


@6 


@8 


QIO 


This 
RankBoost 
RankSVM 
FRank 
ListNet 
AdaRank.MAP 
AdaRank.NDCG 


0.420 ±0.192 
0.270 ±0.104 
0.350 ±0.132 
0.370 ±0.148 
0.420 ±0.164 
0.310 ±0.096 
0.400 ± 0.203 


0.340 ±0.161 
0.230 ±0.112 
0.300 ±0.137 
0.260 ± 0.082 
0.310 ±0.129 
0.230 ±0.105 
0.305 ±0.183 


0.283 ±0.131 
0.210 ± 0.080 
0.243 ±0.100 
0.223 ± 0.043 
0.283 ± 0.090 
0.163 ±0.081 
0.237 ±0.161 


0.253 ± 0.115 
0.193 ±0.071 
0.233 ± 0.091 
0.210 ± 0.045 
0.240 ± 0.075 
0.125 ±0.064 
0.190 ±0.140 


0.222 ±0.106 
0.178 ±0.053 
0.206 ± 0.082 
0.186 ±0.049 
0.222 ± 0.061 
0.102 ±0.050 
0.156 ±0.120 


Table 4. TD2003; Mean ± Stdev for precision over 5 folds 




@2 


@4 


@6 


@8 


@10 


This 
RankBoost 
RankSVM 
FRank 
ListNet 
AdaRank.MAP 
AdaRank.NDCG 


0.473 ±0.132 
0.473 ± 0.055 
0.433 ± 0.094 
0.467 ±0.113 
0.427 ± 0.080 
0.393 ± 0.060 
0.360 ±0.161 


0.454 ± 0.075 
0.439 ± 0.057 
0.406 ± 0.086 
0.435 ± 0.088 
0.422 ± 0.049 
0.387 ± 0.086 
0.377 ±0.123 


0.450 ± 0.059 
0.448 ± 0.052 
0.397 ± 0.082 
0.445 ± 0.078 
0.418 ±0.057 
0.399 ± 0.085 
0.378 ±0.117 


0.459 ± 0.050 
0.461 ± 0.036 
0.410 ± 0.074 
0.455 ± 0.055 
0.449 ± 0.041 
0.400 ± 0.086 
0.380 ±0.102 


0.472 ± 0.043 
0.472 ± 0.034 
0.420 ± 0.067 
0.471 ± 0.057 
0.458 ± 0.036 
0.406 ± 0.083 
0.388 ± 0.093 



Table 5. TD2004: Mean ± Stdev for NDCG over 5 folds 



it is likely that different judges would have different benchmarks as well) • The 
simplicity of our approach is also its main limitation. However, it can easily be 
implemented in conjunction with other ranking ideas. For example, recent work 
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@2 


@4 


m 


@8 


@10 


This 
RankBoost 
RankSVM 
FRank 
ListNet 
AdaRank.MAP 
AdaRank.NDCG 


0.447 ±0.146 
0.447 ± 0.056 
0.407 ± 0.098 
0.433 ±0.115 
0.407 ± 0.086 
0.353 ± 0.045 
0.320 ±0.139 


0.370 ± 0.095 
0.347 ± 0.083 
0.327 ± 0.089 
0.340 ± 0.098 
0.357 ±0.087 
0.300 ± 0.086 
0.300 ± 0.082 


0.316 ±0.076 
0.304 ± 0.079 
0.273 ± 0.083 
0.311 ±0.082 
0.307 ± 0.084 
0.282 ± 0.068 
0.262 ± 0.092 


0.288 ± 0.076 
0.277 ± 0.070 
0.247 ± 0.082 
0.273 ±0.071 
0.287 ± 0.069 
0.242 ± 0.063 
0.232 ± 0.086 


0.264 ± 0.062 
0.253 ± 0.067 
0.225 ± 0.072 
0.256 ± 0.071 
0.257 ± 0.059 
0.216 ± 0.064 
0.207 ± 0.082 



Table 6. TD2004; Mean ± Stdcv for prccisfon over 5 folds 





OHSUMED 


TD2003 


TD2004 


This 
RankBoost 
RankSVM 
FRank 
ListNet 
AdaRank.MAP 
AdaRank.NDCG 


0.445 ± 0.065 
0.440 ± 0.062 
0.447 ± 0.067 
0.446 ± 0.062 
0.450 ± 0.063 
0.442 ± 0.061 
0.442 ± 0.058 


0.248 ± 0.075 
0.212 ±0.047 
0.256 ±0.083 
0.245 ± 0.065 
0.273 ± 0.068 
0.137 ±0.063 
0.185 ±0.105 


0.379 ±0.051 
0.384 ± 0.043 
0.350 ± 0.072 
0.381 ± 0.069 
0.372 ± 0.046 
0.331 ±0.089 
0.299 ± 0.088 



Table 7. Mean ± Stdev for MAP over 5 folds 



by Geng et al. [4] (not evaluated on LETOR) proposes query dependent ranking, 
where the category of a query is determined using a k-Nearest Neighbor method. 
It is immediate to apply the ideas here within each category. 
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